Link to my Detailed YouTube Video Explaining the whole Notebook
Identify whether a given piece of file/software is a malware.
(This Test Dataset is based on splitting only train.7z (which is ~200GB after extraction) into Train, Test and CV )
What is in this kernel
You are provided with a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names to which the malware may belong:

For each file, the raw data contains the hexadecimal representation of the file's binary content, without the PE header (to ensure sterility). You are also provided a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using the IDA disassembler tool. Your task is to develop the best mechanism for classifying files in the test set into their respective family affiliations.
The dataset contains the following files:
Here we are provided with raw data and no pre-extracted features were available
Due to the large size of the dataset (500GB), I had real issues fitting the data into memory during runtime (Colab Pro Failed, and I definitely could not run it in Kaggle)
In Kaggle I got out of disk-space when trying to extract only the train.7z file.
The below code started extracting and only at around 5% I got the out of disk error
!pip install py7zr
!python -m py7zr x full_path_of_7z_file
!python -m py7zr x /content/gdrive/MyDrive/MS_Malware_Kaggle_to_Gdrive/train.7z
It could definitely have been done in Google-Cloud or AWS but have not tried these option.
Therefore, I ONLY extracted train.7z (which is ~200GB after extraction) in my local machine and then split this set into Train, Test and CV set. And from this split dataset, I did my entire analysis on the train and did the validation part on test and cv set.
Further to be able to accomodate it in my local Machine (which is not too high-end )
First did all my calculations and experimentations and featuriazation ONLY on a sample of 50 files (i.e. 50 each from byteFiles and asmFile ) -
After ONLY I saw that all the featuriazation calculations and xgBoost is running on these 50 samples, then only I ran the same notebook on the full Dataset of 200GB with 20,000+ files
And here's my approach for calculating all the featuriazations (both for 50-samples and full-dataset).
i.e. This includes calculating the below features in local machine.
Added following extra features.
After merging all the above features, the merged dataframe that I got, I created to a .csv file from that (i.e. with the regular to_csv() function ).
This .csv file with the final merged dataset was just about 170MB. Then Uploaded this file to google-drive.
And then from Colab just imported that same final merged .csv file and saved that in a pandas dataframe > do train test and cv split on this and >
Now below 2 steps with Colab Pro's Tesla V100 16GB GPU
RandomizedSearchCV for hyper param tuning > and after ran XGBoost with best params.
And these final RandomizedSearchCV and XGBoost took only like 30 mints.
</h1>2.1.2. Example Data Point</h3>
.asm file
.text:00401000 assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing .text:00401000 56 push esi .text:00401001 8D 44 24 08 lea eax, [esp+8] .text:00401005 50 push eax .text:00401006 8B F1 mov esi, ecx .text:00401008 E8 1C 1B 00 00 call ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &) .text:0040100D C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08 .text:00401013 8B C6 mov eax, esi .text:00401015 5E pop esi .text:00401016 C2 04 00 retn 4 .text:00401016 ; --------------------------------------------------------------------------- .text:00401019 CC CC CC CC CC CC CC align 10h .text:00401020 C7 01 08 BB 42 00 mov dword ptr [ecx], offset off_42BB08 .text:00401026 E9 26 1C 00 00 jmp sub_402C51 .text:00401026 ; --------------------------------------------------------------------------- .text:0040102B CC CC CC CC CC align 10h .text:00401030 56 push esi .text:00401031 8B F1 mov esi, ecx .text:00401033 C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08 .text:00401039 E8 13 1C 00 00 call sub_402C51 .text:0040103E F6 44 24 08 01 test byte ptr [esp+8], 1 .text:00401043 74 09 jz short loc_40104E .text:00401045 56 push esi .text:00401046 E8 6C 1E 00 00 call ??3@YAXPAX@Z ; operator delete(void *) .text:0040104B 83 C4 04 add esp, 4 .text:0040104E .text:0040104E loc_40104E: ; CODE XREF: .text:00401043j .text:0040104E 8B C6 mov eax, esi .text:00401050 5E pop esi .text:00401051 C2 04 00 retn 4 .text:00401051 ; ---------------------------------------------------------------------------
.bytes file
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
Source: https://www.kaggle.com/c/malware-classification#evaluation
Metric(s):
Objective: Predict the probability of each data-point belonging to each of the nine classes.
Constraints:
%%time
%pip install -U tornado
%pip install "dask[complete]"
import warnings
warnings.filterwarnings("ignore")
import shutil
import os
import pandas as pd
import matplotlib
matplotlib.use(u'nbAgg')
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm
import pickle
from sklearn.manifold import TSNE
from sklearn import preprocessing
import pandas as pd
from multiprocessing import Process# this is used for multithreading
import multiprocessing
import codecs# this is used for file operations
import random as r
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import re
from nltk.util import ngrams
from sklearn.feature_selection import SelectKBest, chi2, f_regression
import scipy.sparse
import gc
import pickle as pkl
from datetime import datetime as dt
import dask.dataframe as dd
# separating byte files and asm files
# Below is from AML Assignment file
from google.colab import drive
drive.mount('/content/gdrive')
root_path = '/content/gdrive/MyDrive/AML_Malware/Full_data/'
# root_path = '../../LARGE_Datasets/'
#separating byte files and asm files
source = 'train'
destination_1 = 'byteFiles'
destination_2 = 'asmFiles'
# we will check if the folder 'byteFiles' exists if it not there we will create a folder with the same name
if not os.path.isdir(destination_1):
os.makedirs(destination_1)
if not os.path.isdir(destination_2):
os.makedirs(destination_2)
# if we have folder called 'train' (train folder contains both .asm files and .bytes files) we will rename it 'asmFiles'
# for every file that we have in our 'asmFiles' directory we check if it is ending with .bytes, if yes we will move it to
# 'byteFiles' folder
# so by the end of this snippet we will separate all the .byte files and .asm files
if os.path.isdir(source):
data_files = os.listdir(source)
for file in data_files:
print(file)
if (file.endswith("bytes")):
shutil.move(source+'\\'+file,destination_1)
if (file.endswith("asm")):
shutil.move(source+'\\'+file,destination_2)
Y=pd.read_csv("trainLabels.csv")
total = len(Y)*1.
ax=sns.countplot(x="Class", data=Y)
for p in ax.patches:
ax.annotate('{:.1f}%'.format(100*p.get_height()/total), (p.get_x()+0.1, p.get_height()+5))
#put 11 ticks (therefore 10 steps), from 0 to the total number of rows in the dataframe
ax.yaxis.set_ticks(np.linspace(0, total, 11))
#adjust the ticklabel to the desired format, without changing the position of the ticks.
ax.set_yticklabels(map('{:.1f}%'.format, 100*ax.yaxis.get_majorticklocs()/total))
plt.show()
#file sizes of byte files
files=os.listdir('byteFiles')
filenames=Y['Id'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in files:
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat('byteFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
data_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
print (data_size_byte.head())
#boxplot of byte files
ax = sns.boxplot(x="Class", y="size", data=data_size_byte)
plt.title("boxplot of .bytes file sizes")
plt.show()
#removal of addres from byte files
# contents of .byte files
# ----------------
#00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08
#-------------------
#we remove the starting address 00401000
files = os.listdir('byteFiles')
filenames=[]
array=[]
for file in files:
if(file.endswith("bytes")):
file=file.split('.')[0]
text_file = open('byteFiles/'+file+".txt", 'w+')
with open('byteFiles/'+file+".bytes","r") as fp:
lines=""
for line in fp:
a=line.rstrip().split(" ")[1:]
b=' '.join(a)
b=b+"\n"
text_file.write(b)
fp.close()
os.remove('byteFiles/'+file+".bytes")
text_file.close()
files = os.listdir('byteFiles')
filenames2=[]
feature_matrix = np.zeros((len(files),257),dtype=int)
k=0
# program to convert into bag of words of bytefiles
# this is custom-built bag of words this is unigram bag of words
# This is a Custom Implementation of CountVectorizer as CountVectorizer will NOT suport working on such huge file system of 50GB
# For this Uni-Gram feature creating and writing to a file named 'result.csv'
byte_feature_file=open('result.csv','w+')
byte_feature_file.write("ID,0,1,2,3,4,5,6,7,8,9,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??")
byte_feature_file.write("\n")
for file in files:
filenames2.append(file)
byte_feature_file.write(file+",")
if(file.endswith("txt")):
with open('byteFiles/'+file,"r") as byte_flie:
for lines in byte_flie:
line=lines.rstrip().split(" ")
for hex_code in line:
if hex_code=='??':
feature_matrix[k][256]+=1
else:
feature_matrix[k][int(hex_code,16)]+=1
byte_flie.close()
for i, row in enumerate(feature_matrix[k]):
if i!=len(feature_matrix[k])-1:
byte_feature_file.write(str(row)+",")
else:
byte_feature_file.write(str(row))
byte_feature_file.write("\n")
k += 1
byte_feature_file.close()
byte_features=pd.read_csv("result.csv")
byte_features['ID'] = byte_features['ID'].str.split('.').str[0]
byte_features.head(2)
data_size_byte.head(2)
byte_features_with_size = byte_features.merge(data_size_byte, on='ID')
byte_features_with_size.to_csv("result_with_size.csv")
byte_features_with_size.head(2)
# https://stackoverflow.com/a/29651514
def normalize(df):
result1 = df.copy()
for feature_name in df.columns:
if (str(feature_name) != str('ID') and str(feature_name)!=str('Class')):
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result1[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result1
result = normalize(byte_features_with_size)
result.head(2)
data_y = result['Class']
result.head()
#multivariate analysis on byte files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
#this is with perplexity 30
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
data_y = result['Class']
# split the data into test and train by maintaining same distribution of output varaible 'y_true' [stratify=y_true]
X_train, X_test, y_train, y_test = train_test_split(result.drop(['ID','Class'], axis=1), data_y,stratify=data_y,test_size=0.20)
# split the train data into train and cross validation by maintaining same distribution of output varaible 'y_train' [stratify=y_train]
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train,stratify=y_train,test_size=0.20)
print('Number of data points in train data:', X_train.shape[0])
print('Number of data points in test data:', X_test.shape[0])
print('Number of data points in cross validation data:', X_cv.shape[0])
# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = y_train.value_counts().sortlevel()
test_class_distribution = y_test.value_counts().sortlevel()
cv_class_distribution = y_cv.value_counts().sortlevel()
my_colors = 'rgbkymc'
train_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in train data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',train_class_distribution.values[i], '(', np.round((train_class_distribution.values[i]/y_train.shape[0]*100), 3), '%)')
print('-'*80)
my_colors = 'rgbkymc'
test_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in test data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-test_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',test_class_distribution.values[i], '(', np.round((test_class_distribution.values[i]/y_test.shape[0]*100), 3), '%)')
print('-'*80)
my_colors = 'rgbkymc'
cv_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in cross validation data')
plt.grid()
plt.show()
# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',cv_class_distribution.values[i], '(', np.round((cv_class_distribution.values[i]/y_cv.shape[0]*100), 3), '%)')
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
print("Number of misclassified points ",(len(test_y)-np.trace(C))/len(test_y)*100)
# C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
A =(((C.T)/(C.sum(axis=1))).T)
#divid each element of the confusion matrix with the sum of elements in that column
# C = [[1, 2],
# [3, 4]]
# C.T = [[1, 3],
# [2, 4]]
# C.sum(axis = 1) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =1) = [[3, 7]]
# ((C.T)/(C.sum(axis=1))) = [[1/3, 3/7]
# [2/3, 4/7]]
# ((C.T)/(C.sum(axis=1))).T = [[1/3, 2/3]
# [3/7, 4/7]]
# sum of row elements = 1
B =(C/C.sum(axis=0))
#divid each element of the confusion matrix with the sum of elements in that row
# C = [[1, 2],
# [3, 4]]
# C.sum(axis = 0) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =0) = [[4, 6]]
# (C/C.sum(axis=0)) = [[1/4, 2/6],
# [3/4, 4/6]]
labels = [1,2,3,4,5,6,7,8,9]
cmap=sns.light_palette("green")
# representing A in heatmap format
print("-"*50, "Confusion matrix", "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("-"*50, "Precision matrix", "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("Sum of columns in precision matrix",B.sum(axis=0))
# representing B in heatmap format
print("-"*50, "Recall matrix" , "-"*50)
plt.figure(figsize=(10,5))
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("Sum of rows in precision matrix",A.sum(axis=1))
# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# ref: https://stackoverflow.com/a/18662466/4084039
test_data_len = X_test.shape[0]
cv_data_len = X_cv.shape[0]
# we create a output array that has exactly same size as the CV data
cv_predicted_y = np.zeros((cv_data_len,9))
for i in range(cv_data_len):
rand_probs = np.random.rand(1,9)
cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))
# Test-Set error.
#we create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
rand_probs = np.random.rand(1,9)
test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))
predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
alpha = [x for x in range(1, 15, 2)]
cv_log_error_array=[]
for i in alpha:
k_clf=KNeighborsClassifier(n_neighbors=i)
k_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=k_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_clf=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train, y_train)
pred_y=sig_clf.predict(X_test)
predict_y = sig_clf.predict_proba(X_train)
print ('log loss for train data',log_loss(y_train, predict_y, labels=logisticR.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_cv)
print ('log loss for cv data',log_loss(y_cv, predict_y, labels=logisticR.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print ('log loss for test data',log_loss(y_test, predict_y, labels=logisticR.classes_, eps=1e-15))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
train_log_error_array=[]
from sklearn.ensemble import RandomForestClassifier
for i in alpha:
r_clf=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=r_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_clf=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
alpha=[10,50,100,500,1000,2000]
cv_log_error_array=[]
for i in alpha:
x_clf=XGBClassifier(n_estimators=i,nthread=-1)
x_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_cv)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=x_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_clf=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
x_clf=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_clf1=RandomizedSearchCV(x_clf,param_distributions=prams,verbose=10,n_jobs=-1,)
random_clf1.fit(X_train,y_train)
print (random_clf1.best_params_)
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
x_clf=XGBClassifier(n_estimators=2000, learning_rate=0.05, colsample_bytree=1, max_depth=3)
x_clf.fit(X_train,y_train)
c_cfl=CalibratedClassifierCV(x_clf,method='sigmoid')
c_cfl.fit(X_train,y_train)
predict_y = c_cfl.predict_proba(X_train)
print ('train loss',log_loss(y_train, predict_y))
predict_y = c_cfl.predict_proba(X_cv)
print ('cv loss',log_loss(y_cv, predict_y))
predict_y = c_cfl.predict_proba(X_test)
print ('test loss',log_loss(y_test, predict_y))
There are 10868 files of asm All the files make up about 150 GB The asm files contains :
Here we extracted 52 features from all the asm files which are important.
We read the top solutions and handpicked the features from those papers/videos/blogs.
Refer:https://www.kaggle.com/c/malware-classification/discussion
“opcode” is short for operational code. These are the bytes stored in memory that the computer actually runs.
Here you will see all the opcodes that a processor supports. An assembler basically takes text and does some relatively simple conversion of it into a file of opcodes that the computer can read and run directly. Most assemble translate very directly to opcodes. And often the 3 to 5 character assembler names for the opcodes are called opcodes. Very technically the opcodes are the binary numbers stored in memory. The names for them in assembler are the opcode names. Also technically not all the binary numbers stored are the opcodes. The opcodes tell the CPU what to do. Often the number right after the opcodes are parameters for the opcode instructions.
Also Check this very complete table of x86 opcodes on x86asm.net and this as well.
There is also asmjit/asmdb project, which provides public domain X86/X64 database in a JSON-like format
In this thesis they used reverse engineering to extract the assembly instructions from a given executable file and chose to use only the opcodes, which are the part of the instruction that specifies the operation to be performed, an example "mov".
By performing statistical analysis on the datasets, a significant difference between the opcodes in malware and benign files was found. Due to this, supervised and unsupervised machine learning approaches like artificial neural network, support vector machine, bayes net, random forest, k nearest neighbours, and self organizing map was used to look at the sequences of these instructions. The unknown files were classified as either malware or benign depending on the presence of, and number of occurrences of different sequences. We show that by using only opcodes without operands (the rest of the instruction), malware can be distinguished from benign files. By using a sequence length of up to four opcodes, a classification accuracy of 95,58% was achieved.
The below 2 blocks of codes (for running Bag of Words or CountVectorizer on the 150GB asm files while handling it parallelly using multiple-cores of our machine) are taken from here.
In the first block, I am doing the parallelization to read 150GB of data to be read with a chunk size of 30GB in each parallel
# This code taken from https://github.com/kunwar-vikrant/Microsoft-Malware-Detection
#intially create five folders
#first
#second
#thrid
#fourth
#fifth
#this code tells us about random split of files into five folders
folder_1 ='first'
folder_2 ='second'
folder_3 ='third'
folder_4 ='fourth'
folder_5 ='fifth'
folder_6 = 'output'
for i in [folder_1,folder_2,folder_3,folder_4,folder_5,folder_6]:
if not os.path.isdir(i):
os.makedirs(i)
source='train/'
files = os.listdir('train')
ID=df['Id'].tolist()
data=range(0,10868)
r.shuffle(data)
count=0
for i in range(0,10868):
if i % 5==0:
shutil.move(source+files[data[i]],'first')
elif i%5==1:
shutil.move(source+files[data[i]],'second')
elif i%5 ==2:
shutil.move(source+files[data[i]],'thrid')
elif i%5 ==3:
shutil.move(source+files[data[i]],'fourth')
elif i%5==4:
shutil.move(source+files[data[i]],'fifth')
And in the second block below what we are doing is
First collect the most important 52 keywords in the 4 variables named 'prefixes', 'opcodes' 'keywords' and 'registers'
Then run a Bag of Words on these 52 keywords, e.g. how many times I see the word 'HEADER' which is under the variable 'prefixes'
So this code below is just a custom implementation of the simple CountVectorizer() function of scikit-learn. But we could not use scikit-learn directly here as it will not support 150GB of data.
# http://flint.cs.yale.edu/cs421/papers/x86-asm/asm.html
def firstprocess():
#The prefixes tells about the segments that are present in the asm files
#There are 450 segments(approx) present in all asm files.
#this prefixes are best segments that gives us best values.
#https://en.wikipedia.org/wiki/Data_segment
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
#this are opcodes that are used to get best results
#https://en.wikipedia.org/wiki/X86_instruction_listings
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
#best keywords that are taken from different blogs
keywords = ['.dll','std::',':dword']
#Below taken registers are general purpose registers and special registers
#All the registers which are taken are best
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\asmsmallfile.txt","w+")
files = os.listdir('first')
for f in files:
#filling the values with zeros into the arrays
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
# https://docs.python.org/3/library/codecs.html#codecs.ignore_errors
# https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
with codecs.open('first/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
# https://www.tutorialspoint.com/python3/string_rstrip.htm
line=lines.rstrip().split()
l=line[0]
#counting the prefixs in each and every line
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
#counting the opcodes in each and every line
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
#counting registers in the line
for i in range(len(registers)):
for li in line:
# we will use registers only in 'text' and 'CODE' segments
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
#counting keywords in the line
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
#pushing the values into the file after reading whole file
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
#same as above
def secondprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\mediumasmfile.txt","w+")
files = os.listdir('second')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('second/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
# same as smallprocess() functions
def thirdprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\largeasmfile.txt","w+")
files = os.listdir('thrid')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('thrid/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def fourthprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\hugeasmfile.txt","w+")
files = os.listdir('fourth/')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('fourth/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def fifthprocess():
prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
keywords = ['.dll','std::',':dword']
registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
file1=open("output\trainasmfile.txt","w+")
files = os.listdir('fifth/')
for f in files:
prefixescount=np.zeros(len(prefixes),dtype=int)
opcodescount=np.zeros(len(opcodes),dtype=int)
keywordcount=np.zeros(len(keywords),dtype=int)
registerscount=np.zeros(len(registers),dtype=int)
features=[]
f2=f.split('.')[0]
file1.write(f2+",")
opcodefile.write(f2+" ")
with codecs.open('fifth/'+f,encoding='cp1252',errors ='replace') as fli:
for lines in fli:
line=lines.rstrip().split()
l=line[0]
for i in range(len(prefixes)):
if prefixes[i] in line[0]:
prefixescount[i]+=1
line=line[1:]
for i in range(len(opcodes)):
if any(opcodes[i]==li for li in line):
features.append(opcodes[i])
opcodescount[i]+=1
for i in range(len(registers)):
for li in line:
if registers[i] in li and ('text' in l or 'CODE' in l):
registerscount[i]+=1
for i in range(len(keywords)):
for li in line:
if keywords[i] in li:
keywordcount[i]+=1
for prefix in prefixescount:
file1.write(str(prefix)+",")
for opcode in opcodescount:
file1.write(str(opcode)+",")
for register in registerscount:
file1.write(str(register)+",")
for key in keywordcount:
file1.write(str(key)+",")
file1.write("\n")
file1.close()
def main():
#the below code is used for multiprogramming
#the number of process depends upon the number of cores present System
#process is used to call multiprogramming
manager=multiprocessing.Manager()
p1=Process(target=firstprocess)
p2=Process(target=secondprocess)
p3=Process(target=thirdprocess)
p4=Process(target=fourthprocess)
p5=Process(target=fifthprocess)
#p1.start() is used to start the thread execution
p1.start()
p2.start()
p3.start()
p4.start()
p5.start()
#After completion all the threads are joined
p1.join()
p2.join()
p3.join()
p4.join()
p5.join()
if __name__=="__main__":
main()
# asmoutputfile.csv(output genarated from the above two cells) will contain all the extracted features from .asm files
# we will use this file directly
dfasm=pd.read_csv("asmoutputfile.csv")
Y.columns = ['ID', 'Class']
result_asm = pd.merge(dfasm, Y,on='ID', how='left')
result_asm.head()
# file sizes of asm files
files=os.listdir('asmFiles')
filenames=Y['ID'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in files:
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat('asmFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
asm_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
print (asm_size_byte.head())
#boxplot of asm files
ax = sns.boxplot(x="Class", y="size", data=asm_size_byte)
plt.title("boxplot of .bytes file sizes")
plt.show()
# add the file size feature to previous extracted features
print(result_asm.shape)
print(asm_size_byte.shape)
result_asm = pd.merge(result_asm, asm_size_byte.drop(['Class'], axis=1),on='ID', how='left')
result_asm.head()
# we normalize the data each column
result_asm = normalize(result_asm)
result_asm.head()
ax = sns.boxplot(x="Class", y=".text:", data=result_asm)
plt.title("boxplot of .asm text segment")
plt.show()
The plot is between Text and class Class 1,2 and 9 can be easly separated
ax = sns.boxplot(x="Class", y=".Pav:", data=result_asm)
plt.title("boxplot of .asm pav segment")
plt.show()
ax = sns.boxplot(x="Class", y=".data:", data=result_asm)
plt.title("boxplot of .asm data segment")
plt.show()
The plot is between data segment and class label class 6 and class 9 can be easily separated from given points
ax = sns.boxplot(x="Class", y=".bss:", data=result_asm)
plt.title("boxplot of .asm bss segment")
plt.show()
plot between bss segment and class label very less number of files are having bss segment
ax = sns.boxplot(x="Class", y=".rdata:", data=result_asm)
plt.title("boxplot of .asm rdata segment")
plt.show()
Plot between rdata segment and Class segment Class 2 can be easily separated 75 pecentile files are having 1M rdata lines
ax = sns.boxplot(x="Class", y="jmp", data=result_asm)
plt.title("boxplot of .asm jmp opcode")
plt.show()
plot between jmp and Class label Class 1 is having frequency of 2000 approx in 75 perentile of files
ax = sns.boxplot(x="Class", y="mov", data=result_asm)
plt.title("boxplot of .asm mov opcode")
plt.show()
plot between Class label and mov opcode Class 1 is having frequency of 2000 approx in 75 perentile of files
ax = sns.boxplot(x="Class", y="retf", data=result_asm)
plt.title("boxplot of .asm retf opcode")
plt.show()
plot between Class label and retf Class 6 can be easily separated with opcode retf The frequency of retf is approx of 250.
ax = sns.boxplot(x="Class", y="push", data=result_asm)
plt.title("boxplot of .asm push opcode")
plt.show()
plot between push opcode and Class label Class 1 is having 75 precentile files with push opcodes of frequency 1000
#multivariate analysis on asm files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result_asm.drop(['ID','Class'], axis=1).fillna(0))
vis_x = results[:, 0]
vis_y = results[:, 1 ]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
# by univariate analysis on the .asm file features we are getting very negligible information from
# 'rtn', '.BSS:' '.CODE' features, so heare we are trying multivariate analysis after removing those features
# the plot looks very messy
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result_asm.drop(['ID','Class', 'rtn', '.BSS:', '.CODE','size'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
TSNE for asm data with perplexity 50
asm_y = result_asm['Class']
asm_x = result_asm.drop(['ID','Class','.BSS:','rtn','.CODE'], axis=1)
X_train_asm, X_test_asm, y_train_asm, y_test_asm = train_test_split(asm_x,asm_y ,stratify=asm_y,test_size=0.20)
X_train_asm, X_cv_asm, y_train_asm, y_cv_asm = train_test_split(X_train_asm, y_train_asm,stratify=y_train_asm,test_size=0.20)
print( X_cv_asm.isnull().all())
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.
# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight]) Fit the calibrated model
# get_params([deep]) Get parameters for this estimator.
# predict(X) Predict the target of new samples.
# predict_proba(X) Posterior probabilities of classification
alpha = [x for x in range(1, 21,2)]
cv_log_error_array=[]
for i in alpha:
k_clf=KNeighborsClassifier(n_neighbors=i)
k_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=k_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
k_clf=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
pred_y=sig_clf.predict(X_test_asm)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',log_loss(y_train_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',log_loss(y_cv_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',log_loss(y_test_asm, predict_y))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
logisticR.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=logisticR.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
# --------------------------------
# default parameters
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,
# class_weight=None)
# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight]) Fit the SVM model according to the given training data.
# predict(X) Perform classification on samples in X.
# predict_proba (X) Perform classification on samples in X.
# some of attributes of RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
r_clf=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
r_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=r_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
r_clf=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_cv_asm)
print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
predict_y = sig_clf.predict_proba(X_test_asm)
print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
for i in alpha:
x_clf=XGBClassifier(n_estimators=i,nthread=-1)
x_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_cv_asm)
cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=x_clf.classes_, eps=1e-15))
for i in range(len(cv_log_error_array)):
print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
best_alpha = np.argmin(cv_log_error_array)
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
x_clf=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_clf.fit(X_train_asm,y_train_asm)
sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
sig_clf.fit(X_train_asm, y_train_asm)
predict_y = sig_clf.predict_proba(X_train_asm)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train_asm, predict_y))
predict_y = sig_clf.predict_proba(X_cv_asm)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv_asm, predict_y))
predict_y = sig_clf.predict_proba(X_test_asm)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test_asm, predict_y))
plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
x_clf=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1]
}
random_cfl=RandomizedSearchCV(x_clf,param_distributions=prams,verbose=10,n_jobs=-1,)
random_cfl.fit(X_train_asm,y_train_asm)
print (random_cfl.best_params_)
# Training a hyper-parameter tuned Xg-Boost regressor on our train data
# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep]) Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
x_clf=XGBClassifier(n_estimators=200,subsample=0.5,learning_rate=0.15,colsample_bytree=0.5,max_depth=3)
x_clf.fit(X_train_asm,y_train_asm)
c_cfl=CalibratedClassifierCV(x_clf,method='sigmoid')
c_cfl.fit(X_train_asm,y_train_asm)
predict_y = c_cfl.predict_proba(X_train_asm)
print ('train loss',log_loss(y_train_asm, predict_y))
predict_y = c_cfl.predict_proba(X_cv_asm)
print ('cv loss',log_loss(y_cv_asm, predict_y))
predict_y = c_cfl.predict_proba(X_test_asm)
print ('test loss',log_loss(y_test_asm, predict_y))
# separating byte files and asm files
# I am doing slight re-arrangement of the files for this FINAL run of Featurization and model training
from google.colab import drive
drive.mount('/content/gdrive')
root_path = '/content/gdrive/MyDrive/Malware/Full_data/'
# root_path = '../../LARGE_Datasets/'
destination_1 = root_path+'byteFiles'
destination_2 = root_path+'asmFiles'
This cell's code is what we have already ran earlier in the experimentation part, including below here again for the sake of completeness
%%time
# This cell's code is what we have already ran earlier in the experimentation part
# Including here again for the sake of completeness
# removal of addres from byte files
# contents of .byte files
# ----------------
#00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08
#-------------------
#we remove the starting address 00401000
files = os.listdir(root_path+'byteFiles/')
filenames=[]
array=[]
for file in tqdm(files):
if(file.endswith("bytes")):
file=file.split('.')[0]
text_file = open(root_path+'byteFiles/'+file+".txt", 'w+')
with open(root_path+'byteFiles/' + file + '.bytes', 'r') as fp:
lines=""
for line in fp:
# rstrip()=> Return a copy of the string with trailing characters removed.
# Once we have removed trailing characters, invoke split() to return the list of string which are separated by ","
# split() specifies the separator to use when splitting the string. By default any whitespace is a separator
a=line.rstrip().split(" ")[1:] # [1:] is equivalent to "1 to end" as we are removing 0-th element of address from byte files
b=' '.join(a)
b=b+"\n" # Python doesn't automatically add line breaks, you need to do that manually
text_file.write(b)
fp.close()
os.remove(root_path+'byteFiles/'+file+".bytes")
text_file.close()
files = os.listdir(root_path+'byteFiles/')
filenames2=[]
feature_matrix = np.zeros((len(files),257),dtype=int)
k=0
# program to convert into bag of words of bytefiles
# this is custom-built bag of words this is unigram bag of words
# This is a Custom Implementation of CountVectorizer as CountVectorizer will NOT suport working on such huge file system of 50GB
# For this Uni-Gram feature creating and writing to a file named 'result.csv'
byte_feature_file=open(root_path + 'result.csv','w+')
byte_feature_file.write("ID,0,1,2,3,4,5,6,7,8,9,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??")
byte_feature_file.write("\n")
for file in tqdm(files):
filenames2.append(file)
byte_feature_file.write(file+",")
if(file.endswith("txt")):
with open(root_path+'byteFiles/'+file,"r") as byte_flie:
for lines in byte_flie:
line=lines.rstrip().split(" ")
for hex_code in line:
if hex_code=='??':
feature_matrix[k][256]+=1
else:
feature_matrix[k][int(hex_code,16)]+=1
byte_flie.close()
for i, row in enumerate(feature_matrix[k]):
if i!=len(feature_matrix[k])-1:
byte_feature_file.write(str(row)+",")
else:
byte_feature_file.write(str(row))
byte_feature_file.write("\n")
k += 1
byte_feature_file.close()
%%time
uni_gram_byte_features = pd.read_csv(root_path + "result.csv")
uni_gram_byte_features['ID'] = uni_gram_byte_features['ID'].str.split('.').str[0]
print('Unigram byte_featues shape ', uni_gram_byte_features.shape)
uni_gram_byte_features.head(2)
%%time
# This cell's code is what we have already ran earlier in the experimentation part
# Including here again for the sake of completeness
Y=pd.read_csv(root_path + "trainLabels.csv")
files=os.listdir(root_path + 'byteFiles')
filenames=Y['Id'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in tqdm(files):
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat(root_path+'byteFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
byte_feature_size=pd.DataFrame({'ID':fnames, 'size':sizebytes,'Class':class_bytes})
print (byte_feature_size.head())
if not os.path.isdir(root_path + "featurization"):
os.makedirs(root_path + "featurization")
if not os.path.isdir(root_path + "featurization/featurization_final"):
os.mkdir(root_path + "featurization/featurization_final")
# Creating and writing to a file named "class_labels.pkl" to get class class_labels and ID from byte unigrams dataframe and save it for later use
class_labels=byte_feature_size["Class"]
with open(root_path+'featurization/class_labels.pkl', 'wb') as file:
pkl.dump(class_labels, file)
'''
https://www.datacamp.com/community/tutorials/pickle-python-tutorial
To open the file for writing, simply use the open() function. The first argument should be the name of your file. The second argument is 'wb'. The w means that you'll be writing to the file, and b refers to binary mode. This means that the data will be written in the form of byte objects.
'''
# Load the class class_labels for training with random forest feature selector
with open(root_path+'featurization/class_labels.pkl', 'rb') as file:
class_labels=pkl.load(file)
N-grams of texts are extensively used in text mining and natural language processing tasks.
This is the main concept; words are basic, meaningful elements with the ability to represent a different meaning when they are in a sentence. By this point, we keep in mind that sometimes word groups provide more benefits than only one word when explaining the meaning. Here is our sentence "I read a book about the history of America."
The machine wants to get the meaning of the sentence by separating it into small pieces. How should it do that?
It can regard words one by one. This is unigram; each word is a gram. "I", "read", "a", "book", "about", "the", "history", "of", "America"
It can regard words two at a time. This is bigram (digram); each two adjacent words create a bigram. "I read", "read a", "a book", "book about", "about the", "the history", "history of", "of America"
It can regard words three at a time. This is trigram; each three adjacent words create a trigram. "I read a", "read a book", "a book about", "book about the", "about the history", "the history of", "history of America"
So, An n-gram is a contiguous sequence of n items from a given sample of text or speech. an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram". When N>3 this is usually referred to as four grams or five grams and so on.
Formula to calculate number of N-grams in a sentence.
If X=Number of words in a given sentence, the number of n-grams for that sentence would be:
Ngram = X - (n - 1)
Example:
Sentence : I want to learn Machine Learning
Unigram: now calculate number of unigrams in sentence using formula
here, X = 6 and N = 1 (for unigram)
Ngram = X - (N - 1)
Ngram = 6 - (1–1) = 6 (i.e. unigram is equal to number of words in a sentence)
Biagram:
here, X = 6 and N = 2 (for biagram)
Ngramk = X - (N - 1)
Ngramk = 6 - (2–1) = 5
%%time
uni_gram_byte_features__with_size = uni_gram_byte_features.merge(byte_feature_size, on="ID")
uni_gram_byte_features__with_size.to_csv(root_path + "featurization/uni_gram_byte_features__with_size.csv", index=False)
uni_gram_byte_features__with_size = normalize(uni_gram_byte_features__with_size)
%%time
from sklearn.feature_extraction.text import CountVectorizer
bigram_tokens="00,01,02,03,04,05,06,07,08,09,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,\
2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,\
59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,\
87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,\
b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,\
e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??"
bigram_tokens=bigram_tokens.split(",")
# Between 00 and FF there are 256 unique values, so if we take each pair of Hexadecimal Values as one word,
# we are dealing with 256 unique values.
# Hence below Function will extract all the possible combinations of bigrams_counts
def calculate_bigram(bigram_tokens):
sentence=""
vocabulary_list_for_byte_bigrams=[]
for i in tqdm(range(len(bigram_tokens))):
for j in range(len(bigram_tokens)):
bigram=bigram_tokens[i]+" "+bigram_tokens[j]
sentence=sentence+bigram+","
vocabulary_list_for_byte_bigrams.append(bigram)
return vocabulary_list_for_byte_bigrams
vocabulary_list_for_byte_bigrams = calculate_bigram(bigram_tokens)
%%time
import scipy
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(),lowercase=False, ngram_range=(2,2),vocabulary=vocabulary_list_for_byte_bigrams)
# For Explanations on "tokenizer=lambda x: x.split()"
# Refer - https://stackoverflow.com/a/37884104/1902852
# Without this "??" was not getting vectorized properly
file_list_byte_files=os.listdir(root_path + 'byteFiles')
features=["ID"]+vectorizer.get_feature_names()
byte_file_bigram_df=pd.DataFrame(columns=features)
# Creating "featurization/byte_files_bigram_df.csv" and writng to it the full bi-gram data frame
with open(root_path + "featurization/byte_files_bigram_df.csv", mode='w') as byte_file_bigram_df:
byte_file_bigram_df.write(','.join(map(str, features)))
byte_file_bigram_df.write('\n')
for _, file in tqdm(enumerate(file_list_byte_files)):
file_id=file.split(".")[0] #ID of each file
file = open(root_path + 'byteFiles/' + file)
corpus_byte_codes=[file.read().replace('\n', ' ').lower()] # corpus_byte_codes holds all the byte codes for a given file
bigrams_counts = vectorizer.transform(corpus_byte_codes) # Returning a sparse vector containing all the bigram counts from the corpus_byte_codes
# Update each row of our dataframe with the bigram counts of the respective file
row = scipy.sparse.csr_matrix(bigrams_counts).toarray()
# Write a single row in the CSV file
byte_file_bigram_df.write(','.join(map(str, [file_id]+list(row[0]))))
byte_file_bigram_df.write('\n')
file.close()
%%time
# Load the byte_files_bigram_df.csv file which is NOT normalized dataset for the byte file's bigrams
# that we created in the previous cell
X_byte_bigram_all_df = pd.read_csv(root_path + "featurization/byte_files_bigram_df.csv")
X_byte_bigram_all_df.head(2)
%%time
from sklearn.feature_selection import SelectKBest, chi2, f_regression
select_kbest_object = SelectKBest(score_func=chi2, k=2000)
# SelectKBest scores the features using a function, which is chi2 here
# Then "removes all but the k highest scoring features"
# Need to remove "ID" column, else will get below error
# "SelectKBest fit: ValueError: could not convert string to float"
most_imp_features_byte_bigram = select_kbest_object.fit(X_byte_bigram_all_df.drop("ID", axis=1), class_labels)
# most_imp_features_byte_bigram.scores_ => gives an array of form
# array([9.79531407e+05, 4.26642398e+04, 1.78812060e+04, ..., 4.33426736e+07])
# So now creating a df from this array
most_imp_byte_bigram_feature_score_df = pd.DataFrame(most_imp_features_byte_bigram.scores_)
# Creating a df from all the column names from the original full X_byte_bigram_all_df df
most_imp_byte_bigram_columns_df = pd.DataFrame(X_byte_bigram_all_df.columns)
# Concat the feature scores along with the feature names in a byte_bigram_df_important_feature_score,
# From this we will get all feature names later, to be matched against X_byte_bigram_all_df - to extract ONLY the best features from the bigrams df data
byte_bigram_df_important_feature_score = pd.concat([most_imp_byte_bigram_columns_df, most_imp_byte_bigram_feature_score_df],axis=1)
byte_bigram_df_important_feature_score.columns = ["Byte Bigram Top 2000 Feature Names","Byte Bigram Top 2000 Feature Score"]
# Find the top 2000 features along with their scores
# byte_bigram_df_important_feature_score=byte_bigram_df_important_feature_score.nlargest(1000, "Byte Bigram Top 2000 Feature Score")
# Return the first 2000 rows with the largest values in the specified column ( "Byte Bigram Top 2000 Feature Score" )
# in descending order. The columns that are not specified are returned as well, but not used for ordering.
# Let's look at the top 10 features along with their scores + Save the feature score DF
byte_bigram_df_important_feature_score = byte_bigram_df_important_feature_score.nlargest(10, "Byte Bigram Top 2000 Feature Score")
byte_bigram_df_important_feature_score.head(2)
# Getting the list of first 2000 feature names
top_2000_most_imp_byte_bigram_feature_names = list(byte_bigram_df_important_feature_score["Byte Bigram Top 2000 Feature Names"])
# top_2000_byte_bigram_features = dd.concat([X_byte_bigram_all_df["ID"], X_byte_bigram_all_df[top_2000_most_imp_byte_bigram]], axis=1)
top_2000_byte_bigram_features = pd.concat([X_byte_bigram_all_df["ID"], X_byte_bigram_all_df[top_2000_most_imp_byte_bigram_feature_names]], axis=1)
top_2000_byte_bigram_features.to_csv(root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv",index=None)
print(top_2000_byte_bigram_features.shape)
top_2000_byte_bigram_features.head(2)
There are 10868 files of asm files which make up about 150 GB
The asm files contains :
Earlier we already extracted all the features with the help of parallel processing. Here we extracted 52 features from all the asm files which are important.
# First read the file that was generated above code
# Meaning, the code that ran for around 48 hours as mentioned above.
dfasm=pd.read_csv(root_path + "/asmoutputfile.csv")
Y.columns = ['ID', 'Class']
# Note, Y is all the Train Labels of 0 to 9 which has been defined earlier as below
# Y = pd.read_csv(root_path + "trainLabels.csv")
unigram_asm = pd.merge(dfasm, Y, on='ID', how='left')
unigram_asm = normalize(unigram_asm)
unigram_asm.head()
This cell's code is what we have already ran earlier in the experimentation part, including below here again for the sake of completeness
# file sizes of byte files
# This code is very much similar to what has been used to extract sizes of
# byte files earlier.
files=os.listdir(root_path + 'asmFiles')
filenames=Y['ID'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in tqdm(files):
# print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
# os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0,
# st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
# read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
statinfo=os.stat(root_path + 'asmFiles/'+file)
# split the file name at '.' and take the first part of it i.e the file name
file=file.split('.')[0]
if any(file == filename for filename in filenames):
i=filenames.index(file)
class_bytes.append(class_y[i])
# converting into Mb's
sizebytes.append(statinfo.st_size/(1024.0*1024.0))
fnames.append(file)
asm_file_size=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
# asm_file_size.to_csv(root_path + "featurization/asm_file_size.csv", index=False)
asm_file_size.head()
unigram_asm_feature__with_size=pd.merge(asm_file_size, unigram_asm.drop(columns=["Class"]),on='ID', how='left')
unigram_asm_feature__with_size.to_csv(root_path + "featurization/unigram_asm_feature__with_size")
unigram_asm_feature__with_size.head()
%%time
import numpy as np
import os
import codecs
import imageio
import array
from datetime import datetime as dt
if not os.path.isdir(root_path + "image_file_asm"):
os.mkdir(root_path + "image_file_asm")
asmfile_list=os.listdir(root_path + "asmFiles/")
# Function to extract images from ASM files and save them to a specified folder (the second arg to the func)
def extract_images_from_text(arr_of_filenames, folder_to_save_generated_images):
for file_name in tqdm(arr_of_filenames):
if(file_name.endswith("asm")):
this_file = codecs.open(root_path + "asmFiles/" + file_name, 'rb')
size_of_current_asm_file = os.path.getsize(root_path + "asmFiles/"+file_name)
width_of_file = int(size_of_current_asm_file**0.5)
remainder = size_of_current_asm_file % width_of_file
# To create array of single bytes, passing type code 'B'
# "B" is for unsigned characters
array_of_image = array.array('B')
array_of_image.fromfile(this_file, size_of_current_asm_file-remainder)
this_file.close()
arr_of_generated_image = np.reshape(array_of_image[:width_of_file * width_of_file], (width_of_file, width_of_file))
arr_of_generated_image = np.uint8(arr_of_generated_image)
imageio.imwrite(folder_to_save_generated_images+'/' + file_name.split(".")[0] + '.png', arr_of_generated_image)
# Now invoke the above function
directory_to_save_generated_image = root_path + 'image_file_asm'
extract_images_from_text(asmfile_list, directory_to_save_generated_image)
file_list_asm_files=os.listdir(root_path + 'image_file_asm/')
with open(root_path + "featurization/top_800_image_asm_df.csv", mode='w') as top_800_image_asm_df: #file_list_asm_files = 10868, top_800_image_asm_df=800
# top_800_image_asm_df.write(','.join(map(str, ["ID"]+["pixel_asm{}".format(i) for i in range(800)])))
top_800_image_asm_df.write(','.join(map(str, ["ID"]+["pixel_asm{}".format(i) for i in range(10)])))
top_800_image_asm_df.write('\n')
for image in tqdm(file_list_asm_files):
file_id_asm_files=image.split(".")[0]
# Create a 2 Matrix to contain the image matrix in 2D format
asm_image_array=imageio.imread(root_path + "image_file_asm/"+image)
# Extracting from flattened array the first 800 pixels
# asm_image_array=asm_image_array.flatten()[:800]
asm_image_array=asm_image_array.flatten()[:10]
top_800_image_asm_df.write(','.join(map(str, [file_id_asm_files]+list(asm_image_array))))
top_800_image_asm_df.write('\n')
%%time
top_800_image_asm_df=pd.read_csv(root_path + "featurization/top_800_image_asm_df.csv")
top_800_image_asm_df.head()
We know that the asm files contain assembly language code which comprises keywords, opcodes, registers, APIs.
%%time
opcodes_for_bigram = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
# Converting list to dictionary for faster runtime
dict_asm_opcodes = dict(zip(opcodes_for_bigram, [1 for i in range(len(opcodes_for_bigram))]))
if not os.path.isdir(root_path + "opcodes_asm_files"):
os.mkdir(root_path + 'opcodes_asm_files')
'''
Noting first that the asm files contains :
1. Address
2. Segments
3. Opcodes
4. Registers
5. function calls
6. APIs
Calculating opcode sequences for each asm file and save in form of a text file, so that we can process the ASM files as text files
In that text file, each row corresponds to respective file.
Noting, in asm files the opcodes_for_bigram were not placed side by side, instead there are few words between two opcodes. i.e. The Opcodes occurs with an interval.
So during extraction of opcodes_for_bigram we need to preserve the sequence information.
e.g. which opcode prcede another opcode or which opcode is followed is followed by another opcode.
Based on this, a bigram data-matrix of vectors is to be derived containing the bigram sequence info on each file.
'''
def calculate_sequence_of_opcodes():
asm_file_names=os.listdir(root_path + 'asmFiles')
for this_asm_file in tqdm(asm_file_names):
each_asm_opcode_file = open(root_path + "opcodes_asm_files/{}_opcode_asm_bi_grams.txt".format(this_asm_file.split('.')[0]), "w+")
sequence_of_opcodes = ""
with codecs.open(root_path + 'asmFiles/' + this_asm_file, encoding='cp1252', errors ='replace') as asm_file:
for lines in asm_file:
line = lines.rstrip().split()
for word in line:
if dict_asm_opcodes.get(word)==1:
sequence_of_opcodes += word + ' '
each_asm_opcode_file.write(sequence_of_opcodes + "\n")
each_asm_opcode_file.close()
calculate_sequence_of_opcodes()
opcodes_asm__bigram_vocabulary = calculate_bigram(opcodes_for_bigram)
vectorizer_opcode = CountVectorizer(
tokenizer=lambda x: x.split(),
lowercase=False,
ngram_range=(2, 2),
vocabulary=opcodes_asm__bigram_vocabulary,
) # Noting, without "tokenizer=lambda x: x.split()", "??" would not get vectorized correctly
file_list_opcode = os.listdir(root_path + "opcodes_asm_files")
opcode_features = ["ID"] + vectorizer_opcode.get_feature_names()
opcodes_asm_bigram_df = pd.DataFrame(columns=opcode_features)
with open(
root_path + "featurization/opcodes_asm_bigram_df.csv", mode="w"
) as opcodes_asm_bigram_df:
opcodes_asm_bigram_df.write(",".join(map(str, opcode_features)))
opcodes_asm_bigram_df.write("\n")
for _, this_asm_file in tqdm(enumerate(file_list_opcode)):
this_file_id = this_asm_file.split("_")[0] # ID of each this_asm_file
this_asm_file = open(root_path + "opcodes_asm_files/" + this_asm_file)
corpus_opcodes_from_this_asm_file = [
this_asm_file.read().replace("\n", " ").lower()
] # Variable to hold all opcodes for a given this_asm_file
bigrams_opcodes_asm = vectorizer_opcode.transform(
corpus_opcodes_from_this_asm_file
) # Returning a sparse vector holding all bigram counts from corpus_opcodes_from_this_asm_file
# Update each row of the dataframe with the bigram counts of the respective this_asm_file
# And return a dense ndarray representation of this matrix. Because,
# CountVectorizer produces a sparse representation of the counts using scipy.sparse.csr_matrix
row = scipy.sparse.csr_matrix(bigrams_opcodes_asm).toarray()
opcodes_asm_bigram_df.write(
",".join(map(str, [this_file_id] + list(row[0])))
) # Write a single row in the CSV this_asm_file
opcodes_asm_bigram_df.write("\n")
this_asm_file.close()
opcodes_asm_bigram_df = pd.read_csv(
root_path + "featurization/opcodes_asm_bigram_df.csv"
)
opcodes_asm_bigram_df.head()
X_opcode_asm_bigram = opcodes_asm_bigram_df
y = class_labels
# X_opcode_asm_bigram.head()
#Get the best 500 features using SelectKBest.
kbest_object = SelectKBest(score_func=chi2, k=500)
top_features=kbest_object.fit(X_opcode_asm_bigram.drop("ID", axis=1), y)
# Save a dataframe with the feature scores along with the feature names.
# And we will get the best fetures from this dataframe use to
top_features_scores=pd.DataFrame(top_features.scores_)
# Now to get the original features names i.e. the names of all the columns we will need
# `X_opcode_asm_bigram.columns`
X_opcode_columns=pd.DataFrame(X_opcode_asm_bigram.columns)
# Now concat all original features names as a column with another column
# which is "top_features_scores"
top_asm_opcode_bigram_df=pd.concat([X_opcode_columns, top_features_scores],axis=1)
# Give 2 Names for these 2 columns of data for this newly creaetd dataframe
top_asm_opcode_bigram_df.columns=["ASM_Opcode_Bigram_Top_Feature_Name","ASM_Opcode_Bigram_Top_Feature_Score"]
# Extract the largest 500 from this dataframw based on the values of "top_features_scores"
top_asm_opcode_bigram_df=top_asm_opcode_bigram_df.nlargest(500,"ASM_Opcode_Bigram_Top_Feature_Score")
top_asm_opcode_bigram_df.head()
top_500_asm_bigram_features=list(top_asm_opcode_bigram_df["ASM_Opcode_Bigram_Top_Feature_Name"])
top_500_asm_bigram_df=pd.concat([X_opcode_asm_bigram["ID"], X_opcode_asm_bigram[top_500_asm_bigram_features]], axis=1)
# The "ID" column was being duplicated, hence need to remove that, and also the possibility of any other duplicated column
top_500_asm_bigram_df = top_500_asm_bigram_df.loc[:,~top_500_asm_bigram_df.columns.duplicated()]
top_500_asm_bigram_df.to_csv(root_path + "featurization/featurization_final/top_500_asm_opcodes_bigram_df.csv",index=None)
top_500_asm_bigram_df.head()
# Function to return all possible n*n*n combinations of trigrams
def calculate_trigram(tokens):
sent = ""
trigram_result = []
for i in range(len(tokens)):
for j in range(len(tokens)):
for k in range(len(tokens)):
trigram = tokens[i] + " " + tokens[j] + " " + tokens[k]
trigram_result.append(trigram)
return trigram_result
# test_tokens=['edx','esi','eax']
# trigram_result = calculate_trigram(test_tokens)
# trigram_result
opcodes_trigram = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
opcodes_trigram_asm_vocabulary = calculate_trigram(
opcodes_trigram
) # Holding all n*n*n possible combinations of trigrams_from_asm_files
vectorizer = CountVectorizer(
tokenizer=lambda x: x.split(),
lowercase=False,
ngram_range=(3, 3),
vocabulary=opcodes_trigram_asm_vocabulary,
) # NOTE: without "tokenizer=lambda x: x.split()", "??" would not get vectorized properly
file_lists_asm_opcodes = os.listdir(root_path + "opcodes_asm_files")
features = ["ID"] + vectorizer.get_feature_names()
opcodes_asm_trigram_df = pd.DataFrame(columns=features)
with open(
root_path + "featurization/opcodes_asm_trigram_df.csv", mode="w"
) as opcodes_asm_trigram_df:
opcodes_asm_trigram_df.write(",".join(map(str, features)))
opcodes_asm_trigram_df.write("\n")
for _, current_asm_textized_file in tqdm(enumerate(file_lists_asm_opcodes)):
each_file_id = current_asm_textized_file.split("_")[0]
current_asm_textized_file = open(
root_path + "opcodes_asm_files/" + current_asm_textized_file
)
corpus_for_asm_files_opcodes = [
current_asm_textized_file.read().replace("\n", " ").lower()
] # This will contain all the opcodes_trigram codes for a given current_asm_textized_file
# CountVectorizer produces a sparse representation of the counts using scipy.sparse.csr_matrix.
# Hence below is a sparse vector of all trigram counts from corpus_for_asm_files_opcodes
trigrams_from_asm_files = vectorizer.transform(corpus_for_asm_files_opcodes)
# So now return a dense ndarray representation of this matrix
# Updating each row_trigram_count of the dataframe with trigram counts
# of corresponding current_asm_textized_file
row_trigram_count = scipy.sparse.csr_matrix(trigrams_from_asm_files).toarray()
# Write that single row in the CSV for current_asm_textized_file
opcodes_asm_trigram_df.write(
",".join(map(str, [each_file_id] + list(row_trigram_count[0])))
)
opcodes_asm_trigram_df.write("\n")
current_asm_textized_file.close()
opcodes_asm_trigram_df = pd.read_csv(
root_path + "featurization/opcodes_asm_trigram_df.csv"
)
opcodes_asm_trigram_df.head()
This will be the same sequence of steps what we applied earlier for extracting top 500 Features from ASM bigrams.
%%time
X_opcode_asm_trigram = opcodes_asm_trigram_df
y = class_labels
# X_opcode_asm_trigram.head()
#Get the best 500 features using SelectKBest. Save the feature scores along with the feature names in a feature_score_df_df, which we will use to get the best fetures from the bigrams df data
kbest_object = SelectKBest(score_func=chi2, k=800)
top_features=kbest_object.fit(X_opcode_asm_trigram.drop("ID", axis=1), y)
top_features_scores=pd.DataFrame(top_features.scores_)
X_opcode_columns=pd.DataFrame(X_opcode_asm_trigram.columns)
top_asm_opcode_trigram_df=pd.concat([X_opcode_columns,top_features_scores],axis=1)
top_asm_opcode_trigram_df.columns=["ASM_Opcode_Top_Feature_Name","ASM_Opcode_Top_Feature_Score"]
top_asm_opcode_trigram_df=top_asm_opcode_trigram_df.nlargest(800,"ASM_Opcode_Top_Feature_Score")
top_asm_opcode_trigram_df.head()
%%time
# Get List of the 800 top features
top_800_asm_trigram_features=list(top_asm_opcode_trigram_df["ASM_Opcode_Top_Feature_Name"])
top_800_asm_trigam_df=pd.concat([X_opcode_asm_trigram["ID"], X_opcode_asm_trigram[top_800_asm_trigram_features]], axis=1)
# The "ID" column was being duplicated, hence need to remove that, and also the possibility of any other duplicated column
top_800_asm_trigam_df = top_800_asm_trigam_df.loc[:,~top_800_asm_trigam_df.columns.duplicated()]
top_800_asm_trigam_df.to_csv(root_path + "featurization/featurization_final/top_800_asm_opcodes_trigram_df.csv",index=None)
top_800_asm_trigam_df.head()
%%time
# Unigram of Byte Files + Size of Byte Files +
uni_gram_byte_features__with_size = pd.read_csv(
root_path + "featurization/uni_gram_byte_features__with_size.csv"
)
# Top 52 Unigram of ASM Files + Size of ASM Files
# Droping .BSS, .rtn, .CODE features from the unigram_asm_feature__with_size (which is the unigram of asm files) dataset
# As we earlier saw that these features were not much important in separating class labels
unigram_asm_feature__with_size = pd.read_csv(
root_path + "featurization/unigram_asm_feature__with_size"
).drop(["Class", "rtn", ".BSS:", ".CODE"], axis=1)
# Top 2000 Bi-Gram of Byte files
# top_2000_imp_byte_bigram_df = pd.read_csv(
# root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv"
# ).drop(columns=["ID.1"])
top_2000_imp_byte_bigram_df = pd.read_csv(
root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv"
)
# Top 500 Bigram of Opcodes of ASM Files
top_500_asm_bigram_df = pd.read_csv(root_path + "featurization/featurization_final/top_500_asm_opcodes_bigram_df.csv")
# Top 800 Trigram of Opcodes of ASM Files
top_800_asm_trigam_df = pd.read_csv(root_path + "featurization/featurization_final/top_800_asm_opcodes_trigram_df.csv")
# Top 800 ASM Image Features
top_800_image_asm_df = pd.read_csv(root_path + "featurization/top_800_image_asm_df.csv")
%%time
# Initiate a dataframe for representing the Combined Features
# and set it equal to uni_gram_byte_features__with_size
combined_features_final_df = uni_gram_byte_features__with_size
individual_featuarized_dfs = [
unigram_asm_feature__with_size,
top_800_image_asm_df,
top_2000_imp_byte_bigram_df,
top_500_asm_bigram_df,
top_800_asm_trigam_df
]
for df in tqdm(individual_featuarized_dfs):
# combined_features_final_df = pd.merge(combined_features_final_df, df, on="ID", how="left")
combined_features_final_df = pd.merge(combined_features_final_df, df, on="ID")
combined_features_final_df.to_csv(
root_path + "featurization/featurization_final/combined_features_final_df.csv",
index=None,
)
combined_features_final_df.head()
combined_features_final_df = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df.csv")
combined_features_final_df_normalized = normalize(combined_features_final_df)
combined_features_final_df_normalized.to_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv", index=None)
%%time
final_X = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv").fillna(0).drop(['ID'], axis=1)
final_y = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv")["Class"]
# Splitting - Keep same distribution of class label 'y_true' with [stratify=final_y]
X_train, X_test_final_merged, y_train, y_test_final_merged = train_test_split(final_X, final_y, stratify=final_y, test_size=0.20, random_state=42)
X_train_final_merged, X_cv_final_merged, y_train_final_merged, y_cv_final_merged = train_test_split(X_train, y_train, stratify=y_train, test_size=0.20, random_state=42)
print('Shape of X_train_final_merged and y_train_final_merged: ', X_train_final_merged.shape, y_train_final_merged.shape)
print('Shape of X_test_final_merged and y_test_final_merged: ', X_test_final_merged.shape, y_test_final_merged.shape)
print('Shape of X_cv_final_merged and y_cv_final_merged ', X_cv_final_merged.shape, y_cv_final_merged.shape)
%%time
xgb_clf=XGBClassifier()
prams={
'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
'n_estimators':[100,200,500,1000,2000],
'max_depth':[3,5,10],
'colsample_bytree':[0.1,0.3,0.5,1],
'subsample':[0.1,0.3,0.5,1],
'tree_method':['gpu_hist']
}
random_clf=RandomizedSearchCV(xgb_clf, param_distributions=prams, verbose=10, n_jobs=-1)
random_clf.fit(X_train_final_merged, y_train_final_merged)
print(random_clf.best_params_)
%%time
n_estimators = random_clf.best_params_['n_estimators']
subsample = random_clf.best_params_['subsample']
max_depth = random_clf.best_params_['max_depth']
learning_rate = random_clf.best_params_['learning_rate']
colsample_bytree = random_clf.best_params_['colsample_bytree']
tree_method = random_clf.best_params_['tree_method']
# print(tree_method)
x_clf_with_best_hyper_param=XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate= learning_rate, colsample_bytree=colsample_bytree, subsample=subsample, tree_method=tree_method, nthread=-1)
x_clf_with_best_hyper_param.fit(X_train_final_merged, y_train_final_merged, verbose=True)
sig_clf = CalibratedClassifierCV(x_clf_with_best_hyper_param, method="sigmoid")
sig_clf.fit(X_train_final_merged, y_train_final_merged)
%%time
n_estimators = random_clf.best_params_['n_estimators']
# LOGLOSS FOR TRAIN
predict_y_train = sig_clf.predict_proba(X_train_final_merged)
print ('With best number of estimators = ', n_estimators, "Our train log loss is:", log_loss(y_train_final_merged, predict_y_train))
# LOGLOSS FOR TEST
predict_y_test = sig_clf.predict_proba(X_test_final_merged)
print('For values of best number of estimators = ', n_estimators, "The test log loss is:", log_loss(y_test_final_merged, predict_y_test))
# LOGLOSS FOR CV
predict_y_cv = sig_clf.predict_proba(X_cv_final_merged)
print('With best number of estimators = ', n_estimators, "Our cross validation log loss is:", log_loss(y_cv_final_merged, predict_y_cv))

We could experiment further with following features.